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Abstract 

This paper introduces a spelling correction 
system which integrates seamlessly with 
morphological analysis using a multi-tape 
formalism. Handling of various Semitic er- 
ror problems is illustrated, with reference 
to Arabic and Syriac examples. The model 
handles errors vocalisation, diacritics, pho- 
netic syncopation and morphographemic 
idiosyncrasies, in addition to Damerau er- 
rors. A complementary correction strategy 
for morphologically sound but morphosyn- 
tactically ill-formed words is outlined. 

1 Introduction 

Semitic is known amongst computational linguists, 
in particular computational morphologists, for its 
highly inflexional morphology. Its root-and-pattern 
phenomenon not only poses difficulties for a mor- 
phological system, but also makes error detection 
a difficult task. This paper aims at presenting a 
morphographemic model which can cope with both 
issues. 

The following convention has been adopted. Mor- 
phemes are represented in braces, { }, surface 
(phonological) forms in solidi, / /, and orthographic 
strings in acute brackets, ( ). In examples of gram- 
mars, variables begin with a capital letter. Cs de- 
note consonants, Vs denote vowels and a bar denotes 
complement. An asterisk, *, indicates ill-formed 
strings. 

The difficulties in morphological analysis and er- 
ror detection in Semitic arise from the following 
facts: 

* Supported by a British Telecom Scholarship, ad- 
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• Non-Linearity A Semitic stem consists of a 
root and a vowel melody, arranged accord- 
ing to a canonical pattern. For example, 
Arabic /kuttib/ 'caused to write - perfect pas- 
sive' is composed from the root morpheme 
{ktb} 'notion of writing' and the vowel melody 
morpheme {ui} 'perfect passive'; the two are 
arranged according to the pattern morpheme 
{CVCCVC} 'causative'. This phenomenon is 
analysed by (McCarthy, 1981) along the lines 
of autosegmental phonology (Goldsmith, 1976). 
The analysis appears in (1).^ 

(1) 

Derivation of /kuttib/ 
u i 

/kuttib/ -CVCCVC 

I V I 

k t b 

• Vocalisation Orthographically, Semitic texts 
appear in three forms: (i) consonantal texts 

do not incorporate any short vowels but ma- 
tres lectionis,'^ e.g. Arabic (ktb) for /katab/, 
/kutib/ and /kutub/, but (kaatb) for /kaatab/ 
and /kaatib/; (ii) partially vocalised texts 
incorporate some short vowels to clarify am- 
biguity, e.g. (kutb) for /kutib/ to distinguish 
it from /katab/; and (iii) vocalised texts in- 
corporate full vocalisation, e.g. (tadahraj) for 
/tadahraj/. 

^ We have used the CV model to describe pattern mor- 
phemes instead of prosodic terms because of its familiar- 
ity in the computational linguistics literature. For the 
use of morale and affixational models in handling Arabic 
morphology computationally, see (Kiraz, ). 

^'Mothers of reading', these are consonantal letters 
which play the role of long vowels, and are represented 
in the pattern morpheme by VV (e.g. /aa/, /uu/, 
/ii/). Matres lectioms cannot be omitted from the or- 
thographic string. 



• Vowel and Diacritic Shifts Semitic lan- 
guages employ a large number of diacritics to 
represent enter alia short vowels, doubled let- 
ters, and nunation.^ Most editors allow the user 
to enter such diacritics above and below letters. 
To speed data entry, the user usually enters the 
base characters (say a paragraph) and then goes 
back and enters the diacritics. A common mis- 
take is to place the cursor one extra position 
to the left when entering diacritics. This re- 
sults in the vowels being shifted one position, 
e.g. *(wkatubi) instead of (wakutib). 

• Vocalisms The quality of the perfect and im- 
perfect vowels of the basic forms of the Semitic 
verbs are idiosyncratic. For example, the Syr- 
iac root {ktb} takes the perfect vowel a, e.g. 
/ktab/, while the root {nht} takes the vowel e, 
e.g. /nhet/. It is common among learners to 
make mistakes such as */kteb/ or */nhat/. 

• Phonetic Syncopation A consonantal seg- 
ment may be omitted from the phonetic surface 
form, but maintained in the orthographic sur- 
face from. For example, Syriac (mdTnta)'city' is 
pronounced /mdita/. 

• Idiosyncrasies The application of a mor- 

phographemic rule may have constraints as on 
which lexical morphemes it may or may not ap- 
ply. For example, the glottal stop [?] at the end 
of a stem may become [w] when followed by the 
relative adjective morpheme {iyy}, as in Arabic 
/samaaP-|-iyy/ /samaawiyy/ 'heavenly', but 
/hawaa?-|-iyy/ /hawaa?iyy/ 'of air'. 

• Morphosyntactic Issues In broken plurals, 
diminutives and deverbal nouns, the user may 

enter a morphologically sound, but morphosyn- 
tactically ill-formed word. We shall discuss this 
in more detail in section 4.* 

To the above, one adds language-independent issues 
in spell checking such as the four Damerau trans- 
formations: omission, insertion, transposition and 
substitution (Damerau, 1964). 

2 A Morphographemic Model 

This section presents a morphographemic model 
which handles error detection in non-linear strings. 

■^Whon indefinite, nouns and adjectives end in a pho- 
netic [n] which is represented in the orthographic form 
by special diacritics. 

*For other issues with respect to syntactic dependen- 
cies, see (Abduli, 1990). 



Subsection 2.1 presents the formalism used, and sub- 
section 2.2 describes the model. 

2.1 The Formalism 

In order to handle the non-linear phenomenon of 

Arabic, our model adopts the two-level formalism 
presented by (Pulman and Hepple, 1993), with the 
multi tape extensions in (Kiraz, 1994). Their for- 



(2) 



appears in 


(2). 


Two-Level Formalism 


LLC - 


Lex - RLC 


LSC - 


Surf - RSC 


where 




LLC = 


left lexical context 


Lex = 


lexical form 


RLC = 


right lexical context 


LSC = 


left surface context 


Surf = 


surface form 


RSC = 


right surface context 



The special symbol * is a wildcard matching any con- 
text, with no length restrictions. The operator <^ 
caters for obligatory rules. A lexical string maps to 
a surface string iff they can be partitioned into pairs 
of lexical-surface subsequences, where each pair is 
licenced by a =^> or O rule, and no partition violates 
a <^ rule. In the multi-tape version, lexical expres- 
sions (i.e. LLC, Lex and RLC) are n-tuple of regu- 
lar expressions of the form (xi, X2, . . ., x„): the ith 
expression refers to symbols on the ith tape; a nill 
slot is indicated by e.^ Another extension is giving 
LLC the ability to contain ellipsis, . . . , which in- 
dicates the (optional) omission from LLC of tuples, 
provided that the tuples to the left of . . . are the first 
to appear on the left of Lex. 

In our morphographemic model, we add a similar 
formalism for expressing error rules (3). 

(3) 

Error Formalism 
ErrSurf ^ Surf 
{ PLC - PRC } where 
PLC = partition left context 

(has been done) 
PRC = partition right context 
(yet to be done) 



®Our implementation interprets rules directly; hence, 
we allow £. If the rules were to be compiled into au- 
tomata, a genuine symbol, e.g. 0, must be used. For the 
compilation of our formalism into automata, see (Kiraz 
and Grimley-Evans, 1995). 



The error rulc;s capture the correspondencx; be- 
tween the error surface and the correct surface, given 
the surrounding partition into surface and lexical 
contexts. They happily utilise the multi-tape format 
and integrate seamlessly into morphological analy- 
sis. PLC and PRC above are the left and right con- 
texts of both the lexical and (correct) surface levels. 
Only the is used (error is not obligatory). 

2.2 The Model 

2.2.1 Finding the error 

Morphological analysis is first called with the as- 
sumption that the word is free of errors. If this fails, 
analysis is attempted again without the 'no error' re- 
striction. The error rules are then considered when 
ordinary morphological rules fail. If no error rules 
succeed, or lead to a successful partition of the word, 
analysis backtracks to try the error rules at succes- 
sively earlier points in the word. 

For purposes of simplicity and because on the 
whole is it likely that words will contain no more 
than one error (Damerau, 1964; Pollock and Zamora, 
1983), normal 'no error' analysis usually resumes if 
an error rule succeeds. The exception occurs with a 
vowel shift error (§3.2.1). If this error rule succeeds, 
an expectation of further shifted vowels is set up, 
but no other error rule is allowed in the subsequent 
partitions. For this reason rules are marked as to 
whether they can occur more than once. 

2.2.2 Suggesting a correction 

Once an error rule is selected, the corrected sur- 
face is substituted for the error surface, and nor- 
mal analysis continues - at the same position. The 
substituted surface may be in the form of a vari- 
able, which is then ground by the normal analysis 
sequence of lexical matching over the lexicon tree. 
In this way only lexical words are considered, as 
the variable letter can only be instantiated to letters 
branching out from the current position on the lexi- 
con tree. Normal prolog backtracking to explore al- 
ternative rules/lexical branches applies throughout. 

3 Error Checking in Arabic 

We demonstrate our model on the Arabic verbal 
stems shown in (4) (McCarthy, 1981). Verbs are 
classified according to their measure (M): there 
are 15 trilateral measures and 4 quadrilateral ones. 
Moving horizontally across the table, one notices a 
change in vowel melody (active {a}, passive {ui}); 
everything else remains invariant. Moving vertically, 
a change in canonical pattern occurs; everything else 
remains invariant. 



Subsection 3.1 presents a simple two-level gram- 
mar which describes the above data. Subsection 3.2 
presents error checking. 

(4) 

Arabic Verbal Stems 



ure 


Active 


Passive 


1 


katab 


kutib 


2 


kattab 


kuttib 


3 


kaatab 


kuutib 


4 


paktab 


Puktib 


5 


takattab 


tukuttib 


6 


takaatab 


tukuutib 


7 


nkatab 


nkutib 


8 


ktatab 


ktutib 


9 


ktabab 




10 


staktab 


stuktib 


11 


ktaabab 




12 


ktawtab 




13 


ktawwab 




14 


ktanbab 




15 


ktanbay 




Ql 


dahraj 


duhrij 


Q2 


tadahraj 


tuduhrij 


Q3 


dhanraj 


dhunrij 


Q4 


dharjaj 


dhurjij 



3.1 Two-Level Rules 

The lexical level maintains three lexical tapes (Kay, 
1987; Kiraz, 1994): pattern tape, root tape and vo- 
calism tape; each tape scans a lexical tree. Exam- 
ples of pattern morphemes are: {C1V1C2V1C3} (M 1), 
{ciC2Vinc3V2C4} (M Q3). The root morphemes are 
{ktb} and {dhrj}, and the vocalism morphemes are 
{a} (active) and {ui} (passive). 

The following two-level grammar handles the 
above data. Each lexical expression is a triple; lex- 
ical expressions with one symbol assume e on the 
remaining positions. 

(5) 

General Rules 

T?n- - A - ^ 

* - (Pc, C,e) - * ^ 

* - (P„,£,V) - * ^ 

where Pc e {ci, C2, C3, C4}, 
G {vi, V2}, 



(5) gives throe general rules: RO allows any char- 
acter on the first lexical tape to surface, e.g. in- 
fixes, prefixes and suffixes. Rl states that any P e 
{ci, C2, C3, C4} on the first (pattern) tape and C 
on the second (root) tape with no transition on the 
third (vocalism) tape corresponds to C on the sur- 
face tape; this rule sanctions consonants. Similarly, 
R2 states that any P e {vi, V2} on the pattern tape 
and V on vocalism tape with no transition on the 
root tape corresponds to V on the surface tape; this 
rule sanctions vowels. 



(6) 



Boundary Rules 
(B, e, e) - 

tXO. J, 



+ 



Resuming the description of the grammar, (8) 
presents spreading rules. Notice the use of ellipsis 
to indicate that there can be tuples separating Lex 
and LLC, as far as the tuples in LLC are the nearest 
ones to Lex. R5 sanctions the spreading (and gem- 
ination) of consonants. R6 sanctions the spreading 
of the first vowel. Spreading examples appear in (9). 



(9) 



Derivation of M 1- M 3 



a. /katab/ 



a 


+ 


VT 


k 


t 


b 


+ 


RT 


Cl 


Vl 


C2 


Vl 


C3 


+ 


PT 



12 16 14 





* 


k 


a 


t 


a 


b 





ST 



R4: 



(B,*,*) - (+,+,+) - * 



where B ^ + 

(6) gives two boundary rules: R3 is used for non- 
stem morphemes, e.g. prefixes and suffixes. R4 ap- 
plies to stem morphemes reading three boundary 
symbols simultaneously; this marks the end of a 
stem. Notice that LLC ensures that the right bound- 
ary rule is invoked at the right time. 

Before embarking on the rest of the rules, an il- 
lustrated example seems in order. The derivation 
of /dhunrija/ (M Q5, passive), from the three mor- 
phemes {ciC2Vinc3V2C4}, {dhrj} and {ui}, and the 
suffix {a} '3rd person' is illustrated in (7). 



(7) 



Derivation of M Q3 + {a} 

vocalism tape 
root tape 
pattern tape 



surface tape 



u 


i 


+ 




d 


h 


r 


j 


+ 




Cl 


C2 


Vl 


n 


C3 


V2 


C4 


+ 


a 


+ 


1 1 2 1 2 1 4 3 


d 


h 


u 


n 


r 


i 


J 




a 





The numbers between the surface tape and the 
lexical tapes indicate the rules which sanction the 
moves. 



(8) 



Spreading Rules 
R5: 



(Pi, C, s) 
* 



R6: 



. (vi,£, V) 



P 
C 

Vl 

V 



where Pi G {c2, C3, C4} 



b. /kattab/ 



c. /kaatab/ 



a 


+ 


VT 


k 


t 


b 


+ 


RT 


Cl 


Vl 


C2 


C2 


Vl 


C3 


+ 


PT 


1 


2 


1 


5 


6 


1 


4 




k 


a 


t 


t 


a 


b 




ST 


















a 


+ 


VT 


k 


t 


b 


+ 


RT 


Cl 


Vl 


Vl 


C2 


Vl 


C3 


+ 


PT 


1 


2 


6 


1 


6 


1 


4 




k 


a 


a 


t 


a 


b 




ST 



The following rules allow for the different possible 

orthographic vocalisations in Semitic texts: 

(V,e,e) - (V,e, £) - (V, e, e) 
* - s - * 



R7 



R8 



R9 



(Pc 



, Cl, 



(vi,e,e) 
e 



(P, e, V) 
e 



(Pc2, C2, e) 



where A = (vi,e,V)- • •(Pci,Cl,£) and p = (P,::2,C2,e). 

R7 and R8 allow the optional deletion of short 
vowels in non-stem and stem morphemes, respec- 
tively; note that the lexical contexts make sure that 
long vowels are not deleted. R9 allows the optional 
deletion of a short vowel what is the cause of spread- 
ing. For example the rules sanction both /katab/ 
(M 1, active) and /kutib/ (M 1, passive) as inter- 
pretations of (ktb) as showin in (10). 

3.2 Error Rules 

Below are outlined error rules resulting from pecu- 
liarly Semitic problems. Error rules can also be con- 
structed in a similar vein to deal with typographical 
Damerau error (which also take care of the issue of 



wrong vocalisms). 
(10) 



Two-Level Derivation of M 1 



a. /katab/ 



b. /kutib/ 



a 


+ 


V 1 


k 


t 


b 


+ 


RT 


Cl 


VI 


('2 


Vl 


(■;! 


+ 


PT 


1 


8 


1 


9 


1 


4 




k 




t 




b 




ST 
















u 


i 


+ 


VT 


k 


t 


b 


+ 


RT 


Cl 


Vl 


C2 


Vl 


C3 


+ 


PT 


1 


8 


1 


9 


1 


4 




k 




t 




b 




ST 



3.2.1 Vowel Shift 

A vowel shift error rule will be tried with a parti- 
tion on a (short) vowel which is not an expected (lex- 
ical) vowel at that position. Short vowels can legiti- 
mately bo omitted from an orthographic representa- 
tion - it is this fact which contributes to the problem 
of vowel shifts. A vowel is considered shifted if the 
same vowel has been omitted earlier in the word. 
The rule deletes the vowel from the surface. Hence 
in the next pass of (normal) analysis, the partition 
is analysed as a legitimate omission of the expected 
vowel. This prepares for the next shifted vowel to 
be treated in exactly the same way as the first. The 
expectation of this reapplication is allowed for in 
reap = y. 



(11) 



EO: X =^> £ where reap = y 
{ [om^tmv,e,(*,*,X)] •••-*} 



El: X ^ £ 
{ [*,*,(vl,£,X)] . 



where reap = y 
[om^prv,£,(*,*,£)] * } 



In the rules above, 'X' is the shifted vowel. It is 
deleted from the surface. The partition contextual 
tuples consist of [Rule Name, Surf, Lex]. The 
Lex element is a tuple itself of [Pattern, Root, 
Vocalism]. In EO the shifted vowel was analysed 
earlier as an omitted stem vowel (om_stmv) , whereas 
in El it was analysed earlier as an omitted spread 
vowel (om_sprv). The surface/lexical restrictions in 
the contexts could be written out in more detail, but 
both rules make use of the fact that those contexts 
are analysed by other partitions, which check that 
they meet the conditions for an omitted stem vowel 
or omitted spread vowel. 



For example, *(dhru]i) will be interpreted as 
(duhrij). The 'EO's on the rule number line indicate 
where the vowel shift rule was applied to replace an 
error surface vowel with e. The error surface vowels 
are written in italics. 



(12) 



Two-Level Analysis of *(dhruji) 

VT 



u 


i 


+ 


d 


h 


r 


J 


+ 


Cl 


Vl 


C2 


C3 




V2 


C4 




+ 


1 8 1 1 EO 8 1 EO 4 


d 




h 


r 


u 




J 


i 





RT 
PT 

ST 



3.2.2 Deleted Consonant 

Problems resulting from phonetic syncopation can 
be treated as accidental omission of a consonant, 
e.g. *(mdTta), (mdmta). 

(13) 



E2: 
{* 



} 



X where cons(X),reap = n 



3.2.3 Deleted Long Vowel 

Although the error probably results from a differ- 
ent fault, a deleted long vowel can be treated in the 
same way as a deleted consonant. With current tran- 
scription practice, long vowels are commonly written 
as two characters - they are possibly better repre- 
sented as a single, distinct character. 



(14) 



E3: £ 
I * . * 



} 



XX where vowel(X),reap = n 



The form *(tuktib) can be interpreted as either 
(tukuttib) with a deleted consonant (geminated 't') 
or (tukuutib) with a deleted long vowel. 



(15) 



Two-Level Analysis of *(tuktib) 



a. M 5 



u 


i 


+ 


k 


t 


b 


+ 


t 


Vl 


Cl 


Vl 


C2 




C2 


V2 


C3 


+ 


02191 E2 1214 


t 


u 


k 




t 




t 


i 


b 





b. M6 



u 


i 


+ 


k 


t 


b 


+ 


t 


Vl 


Cl 




Vl 


Vl 


C2 


V2 


C3 


+ 


2 1 E3 6 6 1 2 1 4 


t 


u 


k 




u 


u 


t 


i 


b 





VT 

RT 
PT 

ST 

VT 

RT 
PT 

ST 



3.2.4 Substituted Consonant 

One type of morphographcmic error is that conso- 
nant substitution may not take place before append- 
ing a suffix. For example /samaa?/ 'heaven' -I- {iyy} 
'relative adjective' surfaces as (saniaawiyy), where 
P— > w in the given context. A common mistake is to 
write it as *(sammapiyy). 

(16) 

E4: p w where reap = n 

{ * - [glottaLchange, w,(Pc,?,e)] } 

The 'glottaLchange' rule would be a normal mor- 
phological spelling change rule, incorporating con- 
textual constraints (e.g. for the morpheme bound- 
ary) as necessary. 

4 Broken Plurals, Diminutive and 
Deverbal Nouns 

This section deals with morphosyntactic errors 
which are independent of the two-level analy- 
sis. The data described below was obtained from 
Daniel Ponsford (personal communication), based 
on (Wehr, 1971). 

Recall that a Semitic stems consists of a root mor- 
pheme and a vocalism morpheme arranged accord- 
ing to a canonical pattern morpheme. As each root 
does not occur in all vocalisms and patterns, each 
lexical entry is associated with a feature structure 
which indicates inter alia the possible patterns and 
vocalisms for a particular root. Consider the nomi- 
nal data in (17). 

(17) 

Broken Plurals 
Singular Plural Forms 

kadis kuds, *kidaas 

kaafil kuffal, *kufalaap, *kuffaal 

kafiil kufalaa? 

sahm *?ashaam, suhuum, Pashum 

Patterns marked with * are morphologically plausi- 
ble, but do not occur lexically with the cited nouns. 
A common mistake is to choose the wrong pattern. 

In such a case, the two-level model succeeds in 
finding two-level analyses of the word in question, 
but fails when parsing the word morphosyntacti- 
cally: at this stage, the parser is passed a root, vo- 
calism and pattern whose feature structures do not 
unify. 

Usually this feature-clash situation creates the 
problem of which constituent to give prcforonco to 
(Langer, 1990). Here the vocalism indicates the in- 
flection (e.g. broken plural) and the preferance of 
vocalism pattern for that type of inflection belongs 



to the root. For example *(kidaas) would be anal- 
ysed as root {kds} with a broken plural vocalism. 
The pattern type of the vocalism clashes with the 
broken plural pattern that the root expects. To cor- 
rect, the morphological analyser is executed in gen- 
eration mode to generate the broken plural form of 
{kds} in the normal way. 

The same procedure can be applied on diminutive 
and deverbal nouns. 

5 Conclusion 

The model presented corrects errors resulting from 
combining nonconcatenative strings as well as more 
standard morphological or spelling errors. It cov- 
ers Semitic errors relating to vocalisation, diacrit- 
ics, phonetic syncopation and morphographcmic id- 
iosyncrasies. Morphosyntactic issues of broken plu- 
rals, diminutives and deverbal nouns can be handled 
by a complementary correction strategy which also 
depends on morphological analysis. 

Other than the economic factor, an important ad- 
vantage of combining morphological analysis and er- 
ror detection/correction is the way the lexical tree 
associated with the analysis can be used to deter- 
mine correction possibilities. The morphological 
analysis proceeds by selecting rules that hypothesise 
lexical strings for a given surface string. The rules 
are accepted/rejected by checking that the lexical 
string(s) can extend along the lexical tree(s) from 
the current position(s). Variables introduced by er- 
ror rules into the surface string are then instantiated 
by associating surface with lexical, and matching 
lexical strings to the lexicon tree(s). The system is 
unable to consider correction characters that would 
be lexical impossibilities. 
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